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Abstract 


This paper introduces multilevel extensions for the general diagnostic model (GDM) following 
recent developments on extensions of latent class analysis (LCA) to hierarchical models. The GDM 
is based on LCA as well as discrete latent trait models and may be viewed as a general modeling 
framework for confirmatory multidimensional item response models. 

The multilevel extensions presented in this paper enable one to check the impact of clustered 
data, such as data for students within schools in large scale educational surveys, on the structural 
parameter estimates of the GDM. Moreover, the multilevel version of the GDM allows study of 
differences in skill distributions across these clusters. 


Key words: Latent class analysis, multilevel extensions, item response models, diagnostic models, 
logistic models 
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1 Introduction 


This paper introduces a hierarchical extension of the general diagnostic model (GDM; von 
Davier, in press-a) similar to extensions for latent class analysis (LCA; Lazarsfeld & Henry, 1968) 
to multilevel latent class models (Vermunt, 2003). Hierarchical extensions have also been developed 
for linear models (e.g., Bryk & Raudenbush, 1992; Goldstein, 1987) as well as for Rasch-type 
models (e.g., Kamata & Cheong, 2006) and more general IRT models (e.g., Fox & Glas, 2001). 

The GDM is based on LCA as well as discrete latent trait models (Heinen, 1996) and may be 
viewed as a general modeling framework for confirmatory multidimensional item response models 
(see von Davier, in press-b; von Davier & Rost, 2006; von Davier & Yamamoto, 2007). 

The multilevel extensions presented in this paper enable one to check the impact of the 
clustering of observed data, such as data for students within schools in large scale educational 
surveys, on the structural parameter estimates of the GDM. Moreover, the multilevel version of 
the GDM allows the study of differences in skill distributions across these clusters. 

2 The General Diagnostic Model 

Assume an /-dimensional categorical random variable x = {x \,..., xj) with x t £ {0,..., mj}for 
i £ {1,...,/}, referred to as a response vector in the following. Further assume that there are N 
independent and identically distributed (i.i.d.) realizations x±,... of this random variable x, 
so that x n i denotes the i-th component of the n-th realization x n . In addition, assume that there 
are N unobserved realizations of a A'-dimensional categorical variable, a = (ai,... ,ax), so that 
the vector 

( x n , o n ) — ( x n i , • • •, x n j, ci n i,..., a n ,K ) 
exists for all n £ {1,..., N}. The data structure 

(A, A) = {(x n , a n )) n=1 N 

is referred to as the complete data, and (x n ) n=1 N is referred to as the observed data matrix. 
Denote (a n ) 1 N as the latent skill or attribute patterns, which is the unobserved target of 
inference. 

Let P(a) = P (A = (ai,..., a^')) > 0 for all a denote the nonvanishing discrete count density 
of a. Assume that the conditional discrete count density P(x i,... ,xj \ a) exists for all a. Then 
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the probability of a response vector x can be written as 

P(x) = '^2,P{a)P{x i,... ,xi | a). 

a 

2.1 Conditional Independence 

So far, no assumptions have been made about the specific form of the conditional distribution 
of x given a, other than that P(x\,... ,xi \ a) exists. For the general diagnostic model, local 
independence (LI) of the components Xi given a is assumed, which yields 

I 

P{x i, ...,xi\a) = \\pi{x = Xi\ a) 

i= 1 

so that the probability pi{x = Xi \ a) is the one component left to be specified to arrive at a model 
for P(x). 

2.2 Logistic Model Specification 

Logistic models have widespread applications and apart from early disputes about the merits 
of probit versus logit models (Berkson as cited in Cramer, 2003) have secured a prominent position 
among models for categorical data. The general diagnostic model is also specified as model with a 
logistic link between an argument, which depends on the random variables involved and some real 
valued parameters, and the probability of the observed response. 

Using the above definitions, the GDM is defined as follows. Let 

Q = ( qik ), i = 1,..., I, k = 1,..., K 

be a binary I X K matrix, that is q t k £ {0,1}. Let 

(7 ikx) ,i = l^..,I,k = l 1 ...,K,x = l,...,m i 

be a cube of real valued parameters, and let /% x for i = 1and x £ {0,..., mj} be real valued 
parameters. Then define 

, ,^ = exp {Pi X + J2k r rikxh{Qik,ak)) 

PA x I «.) , ... E r^ i exp {(3iy + likyKq ^ ak)) ' 

It is often convenient to constrain the ^ikx somewhat and to specify the real valued function 
h(qik,CLk ) and the in a way that allows emulation of models frequently used in educational 
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measurement and psychometrics. It is convenient to choose a^) = and likx = x liki 

which defines the general diagnostic model for partial credit data (Muraki, 1992). 

Von Davier (2005) has shown that this model already contains several models from the areas 
of item response theory (IRT; Lord & Novick, 1968), latent-class analysis (Lazarsfeld & Henry, 
1968), multiple classification latent-class models (Goodman, 1974; Haberman, 1979; Maris, 1999) 
and diagnostic models (see, for example, von Davier, DiBello, & Yamamoto, 2006). 

3 Mixture General Diagnostic Models 

Von Davier (in press-b) introduced the discrete mixture distribution version of the GDM, 
referred to as the MGDM. In discrete mixture models for item response data (Mislevy & Verhelst, 
1990; Rost, 1990; for an overview, see von Davier & Rost, 2006), the probability of an observation 
x depends on the unobserved latent trait in the case of the GDMs, a, and on a subpopulation 
indicator g , which is also unobserved. The rationale for mixture distribution models is that 
observations from different subpopulations may either differ in their distribution of skills or in 
their approach to the items (e.g., in terms of strategies employed) or both. A discrete mixture 
distribution in the setup of random variables as introduced above includes an unobserved grouping 
indicator g n for n = 1 ,N. The complete data for examinee n then becomes {pP n , a n , g n ), of 
which only x n is observed in mixture distribution models. In multiple group models, (x n ,g n ) is 
observed. 

The conditional independence assumption has to be modified to account for differences 
between groups, that is 

I 

P(x | a,g) = P(x i,... ,xj | a,g) = \\pi{x = x t \ a,g). 

i= 1 

Moreover, assume that the conditional probability of the components x t of x depends on nothing 
but a and g , that is, 

I 

P{x | a,g,z) = H P i(x = Xi \ a,g) = P{x \ a,g ) (1) 

i =1 

for any random variable 2 . In mixture models, when the g n are not observed, the marginal 
probability of a response vector x needs to be found, that is, 

p (x) = J2 7T 9 p (x\ g)> ( 2 ) 

9 
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where P(x \ g) = J2dP(® I 9) p { x I «><?)• The 7r g = P{G = g) are referred to as mixing proportions, 
or class sizes. The class-specific probability of a response vector x given skill pattern a in class g is 
then 


P{x | a,g) = p ( x i I a,g) = J”[ 


i= 1 


i =1 


exp {fiixg T Efc x i'YikgQikQ J k) 

1 + E y exp (Piyg + E k yiikg(likak ) 


(3) 


with class-specific item difficulties (5i xg . The 7 n-g are the slope parameters relating skill k to item i 
in class g. 

Note that mixture models and multiple group models are two extremes, for mixtures models 
no g n is observed, while for multiple group models all g n are observed. Von Davier and Yamamoto 
(2004) pointed this out and described an extension of the GPCM for mixture versions, multiple 
group versions, and partially observed grouping versions, where the g n information is missing only 
for a portion of the sample. 

One important special case of the MGDM is a model that assumes measurement invariance 
across populations, which is expressed in the equality of p(x \ a, g) across groups, or, more formally: 


P(xi | a, g) = p(xi | a, c) for all i 6 { 1 ,and all g, c G {1,..., G}. 


This assumption allows one to write the model equation without the group index g in the 
conditional response probabilities, so that 

I 

p {x) = n 9 P ( x I 9) = J2 71 a p( s I 9) II I ®) ■ ( 4 ) 

9 9 a i =1 

Note that the differences between groups are only present in the p(a \ g), so that the skill 
distribution is the only component with a condition on g in the above equation. The next section 
introduces hierarchical GDM based on mixture distribution versions of the GDM. 


4 Hierarchical General Diagnostic Models 

Hierarchical models introduce an additional structure, often referred to as a cluster variable, 
in the modeling of observed variables to account for correlations in the data. These are attributed 
to the complex structure of the environment in which the data are observed. More concretely, 
one standard example for clustered data is the responses to educational assessments sampled 
from students within schools or classrooms. As a rather sloppy explanation, it seems plausible to 
assume that students within schools are more similar than students across schools (even though 
the amount to which this statement is true may depend on the educational system). Hierarchical 
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models have been developed for linear models (e.g., Bryk & Raudenbush, 1992; Goldstein, 1987) 
as well as for Rasch-type models (e.g., Kamata & Cheong, 2006). 

For the developments presented here, the extension of the LCA to a hierarchical model (e.g., 
Vermunt, 2003, 2004) is of importance. In addition to the latent class or grouping variable g, the 
hierarchical extension of the LCA assumes that each observation n is characterized by an outcome 
s n on a clustering variable s. The clusters identified by this outcome may be schools, classrooms, 
or other sampling units representing the hierarchical structure of the data collection. As Vermunt 
outlined, the (unobserved) group membership g n is thought of as an individual classification 
variable; for two examinees n ^ m there may be two different group memberships, that is, both 
Sn = Sm an d Sn ~f~ 9m are permissible even if they belong to the same cluster (i.e., s n = s m ). 

Moreover, it is assumed that the skill distribution depends only on the group indicator g and 
no other variable, that is, 

p (a | 9 ,z) = P(a | g) (5) 

for any random variable z. More specifically, for the clustering variable s, 

S 

P{g) = ^Zp{s) p {g I s). 

s= 1 

With Equation 5, 

P(a| S ) = ^P( 5 | S )P(a| 5 ), 

9 

for 

P(g I s)P{a | g ) =p(g | s)P{a \ g,s) = P{a,g \ s ). 

As above for the MGDM, assume that the observed responses x depend on the skill pattern a 
and the group index g only. Then 

P{x | g,s) = J2 p(a | g , s)P(x \ a, g,s) = J2 p (« I 9) p (x \ a, g) = P{x \ g) 

a a 

with Equations 1 and 5. Then the marginal distribution of a response pattern x in the hierarchical 
GDM (HGDM) is given by 

P ( f ) = I S )Y. P ( 3 I 9)P(x | a,g), (6) 

S 9 a 

where, as before in the MGDM, the p(a \ g) denote the distribution of the skill patterns in group 
g , and the p{x \ a, g) denote the distribution of the response vector x conditional on skill pattern a 
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and group g. A hierarchical GDM that assumes measurement invariance across clusters and across 
groups is defined by 

P(x) = £ P(g | s) P(S | g)P(x | a), (7) 

S g a 

with conditional response probabilities p(x \ a) = YljP(xi \ a) that do not depend on cluster or 
group variables. 

The increase in complexity of hierarchical GDMs over nonhierarchical versions lies in the fact 
that the group distribution P(g \ s) depends on the cluster variable s. This increases the number of 
group or class size parameters depending on the number of clusters #{s : s € S'}. The estimation 
of item parameter Pi x ( g )and 'yik(g) as well as the additional conditional probabilities of group sizes 
given clusters p(g \ s ) and skill patterns given group P(a \ g) is outlined in the next section. 

5 Estimation of Hierarchical General Diagnostic Models 

The case of fitting models with cluster-dependent response probabilities P{x \ a, s ) will not be 
discussed here. The reason is that a model in which both the skill distributions and the probability 
of correct responses depend on the cluster variable does not allow attribution of the variation 
of observed responses across clusters to differences in skill distributions. Such a model would 
essentially assume that items have different difficulty in different clusters. Even though this is a 
very empathic view of the world, this does not allow drawing any conclusions involving cluster 
differences other than clusters are different. Apart from that, the fact that most applications of 
hierarchical models offer only moderate sample sizes within clusters makes the estimation of a 
multitude of cluster-specific parameters infeasible. 

The estimation of GDMs and MGDMs has been outlined in von Davier (in press-a, in press-b). 
This approach is extended here to the estimation of HGDMs. The expectation-maximization (EM) 
algorithm has been shown to be a suitable one for this kind of estimation problems (Vermunt, 
2003), so that other, more computationally costly methods are not necessary. For the most part, 
researchers will be concerned with fitting less highly parameterized versions of the HGDM, such as 
the models given in Equations 6 and 7. 

mdltm software (von Davier, 2005) enables one to estimate MGDMs and GDMs with HGDMs 
according to Equations 6 and 7. The extensions to enable estimation of these models were recently 
implemented in mdltm based on the research presented in this paper. 
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Since the data are structured hierarchically, the first step is to define the complete data for 
the case of the HGDM. Let S denote the number of clusters in the sample, and let N s denote the 
number of examinees in cluster s, for s = 1,..., S. Then 

• let Xi ns denote the z-tli response of the n-th examinee in cluster s and let x ns denote the 
complete observed response vector of examinee n in cluster s 

• let cikns denote the k- th skill of examinee n in cluster s and let a ns denote the skill pattern of 
examinee n in cluster s 

• let g ns denote the group membership of examinee n in cluster s 

Note that only the xi ns are observed, as are the cluster sizes N s and the number of clusters S. The 
akns an d 9ns are unobserved and have to be inferred by making model assumptions and calculating 
posterior probabilities such as P(g | s) and P(a,g \ x. s ). 

5.1 Marginal Calculations in Hierarchical General Diagnostic Models 

For the complete data (i.e, the observed data x in conjunction with the unobserved skill 
profiles a and group membership g), the marginal likelihood is 

S N s 

L = | | 11 P{%nsi Q-nsi 9ns] ®)j 

S— 1 n= 1 

that is, a sum over cluster-specific distributions of the complete data. With the above assumptions, 

S N s 

i=nn P(%ns | Qns)P^ns \ 9ns)P^9ns \ 

s= 1 n— 1 

which equals 

L x L/q x Lg , 

with 

/ S N s \ / S N s \ / S N s 

l*xl 3 xl,= nn P{%ns | tins') 9ns) nn P(a ns I 9ns) nn p{g ns 

\s=l n— 1 / \,s=l n— 1 / \s=l n— 1 

Note that these components may be rearranged and rewritten as 

S N a I 

4-nnn \ a ns ,g ns )=n n n n = x \g) nM9 \ 

s=l n= 1 i=l g a i x 
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with n(xi,i,a, g) = J2 S n { x ii i, a, g, s) is the frequency of category Xi responses on item i for 
examinees with skill pattern a in group g. Also, 

S Ns 

i«=nn p^ns 1 9ns)= nn^ 1sr (s;s) , 

s=l n =1 a 9 

where n(a; g) is the frequency of skill pattern a in group g. Finally, 

S Ns 

l 9 =n n p^ns i s )= nn^ i s ) n(9;s) 

s=ln=l s g 

holds. The n(g ; s ) represent the frequency of group membership in g in cluster s. 

5.2 Estimation of Cluster-Skill Distributions With the EM Algorithm 

Since unobserved latent variables are involved, the EM algorithm (Dempster, Laird, & Rubin, 
1977) is a convenient choice for estimating GDMs (von Davier, in press-a) as well as MGDMs (von 
Davier, in press-b) and HGDMs. The EM algorithm cycles through the generation of expected 
values and the maximization of parameters given these preliminary expectations until convergence 
is reached. For details on this algorithm, refer to McLachlan and Krishnan (2000). For the HGDM, 
there are three different types of expected values to be generated in the E-step: 

1. hi(x, a, g) = J2 S Sn 1 {%ins = x}P(a, g \ x ns , s ) is the expected frequency of response a; to item 
i for examinees with skill pattern a in group g : estimated across clusters and across examinees 
within clusters 

2. h{a,g) = Es En 9 I x ns ,s) is the expected frequency of skill pattern a and group g , 
estimated across clusters and across examinees within clusters 

3. h(g;s) = ^2 n P(g \ x ns ,s ) is the expected frequency of group g in cluster s, estimated across 
examinees in that cluster 

For the first and second type of the required expected counts, this involves estimating 

| ^ n, _ P{x, s, a, g) _ P{x | a,g)p(a \ g)p(g \ s ) 
a ' g X ' S E g P(x,s,g) T,g p (x,s,g) 

with 

P(x, s,g) = Y^ p (x, s, a,g) = Y^ p {x \ a, g)p{a \ g)p{g \ s) 

a a 
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for each response pattern x ns , for s = 1,..., S and n = 1 ,,N S . For the third type of expected 
count, use 

p{g | x,s ) = ^P{a,g \ x,s), 

a 

which is equivalent to 

, | -, x = P{x,s,g) = J2a p {% I a,9)p{a I g)p(g I s) 

P 9 X ' 5 E g 8, g) J2g [E a P(V I O. SM® I d)p{9 I «)] ’ 

This last probability then allows one to estimate the class membership g given both the observed 
responses x and the known cluster membership s. The utility of the clustering variable may be 
evaluated in terms of increase of the maximum a posteriori probabilities p(g \ x,s) over p(g \ x). 

If the clustering variable s is informative for the classification g, a noticeable increase of the 
maximum posterior probabilities should be observed. The improvement should also be seen in 
terms of the marginal log-likelihood if s in informative for g. 

6 An Application to TOEFL® iBT Datasets 

Simulated data have advantages, such as the truth (i.e., the set of generating values) is known 
and comparisons of different levels of model complexity and misspecification can be made on the 
basis of known deviations from the true model. The disadvantage is that simulated data are by 
origin artificial, so that the impact of model assumptions on model-data fit can only be studied 
under often less than realistic settings. The accuracy of parameter recovery using simulated data 
has been studied with quite satisfactory results for the GDM by von Davier (in press-a) using 
fiat item response data with no missing values, and by Xu and von Davier (2006) for sparse 
matrix samples of item responses as collected in national and international surveys of educational 
outcomes. 

The current exposition focuses on the comparison of results based on two administration of the 
TOEFL® iBT. The target of inference is the stability of estimates relating to clustering variables 
given by language group. The analyses carried out are independent scalings of two TOEFL iBT 
administrations for which Q-matrices were produced. Von Davier (press-a) pointed out that the 
GDM applied to TOEFL data resulted in highly correlated skill variables, and found that a 
two-dimensional, two-parameter logistic (2PL) IRT model across reading and listening domains 
provided a more parsimonious data description. However, the eight-skill model across reading and 
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listening domains was the subject of further investigation by TOEFL experts, so that this model is 
adopted for the analyses with the hierarchical GDM. 

In a first step, the HGDM was compared to the GDM without hierarchical extension, both 
adopting the same Q-matrix based on eight mastery/nonmastery skills for the February and 
November administrations of the TOEFL iBT. The HGDM was estimated according to Equation 
7. In other words, measurement invariance was assumed across mixture components so that only 
the skill distribution could vary across clusters and the response probabilities P(x \ a) depended on 
the skill profile only, not on cluster s or mixture component g. Table 1 shows the skill correlations 
for the February administration, as well as the marginal skill mastery probabilities for the GDM. 
Table 2 shows the same information for the November administration. 

Table 1 

Skill Correlations and Marginal Probabilities of Skill Mastery for the February 
Administration Based on the Nonhierarchical Eight-Skill General Diagnostic Model 
Across 76 Items Assuming Four Listening and Four Reading Skills 



Skill 1 

Skill 2 

Skill 3 

Skill 4 

Skill 5 

Skill 6 

Skill 7 

Skill 8 

Skill 1 

1.00 

0.76 

0.73 

0.80 

0.75 

0.60 

0.69 

0.57 

Skill 2 


1.00 

0.83 

0.81 

0.65 

0.64 

0.67 

0.58 

Skill 3 



1.00 

0.75 

0.68 

0.69 

0.70 

0.63 

Skill 4 




1.00 

0.61 

0.55 

0.58 

0.45 

Skill 5 





1.00 

0.79 

0.76 

0.66 

Skill 6 






1.00 

0.86 

0.80 

Skill 7 







1.00 

0.80 

Skill 8 








1.00 

P( master) 

0.63 

0.61 

0.57 

0.69 

0.54 

0.46 

0.49 

0.39 


The correlations range between 0.67 and 0.86 for skills of the same domain (i.e., among the 
four reading or four listening skills) and are slightly lower across the domains as expected. For 
correlations between one of the four reading skills and one of the four listening skills, the range is 
0.56 to 0.77. These are still substantial correlations, which is due to the fact that overall reading 
and listening domains themselves are highly correlated. A two-dimensional 2PL IRT model 
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Table 2 


Skill Correlations and Marginal Probabilities of Skill Mastery for the November 
Administration Based on the Nonhierarchical Eight-Skill General Diagnostic Model 
Across 76 Items Assuming Four Listening and Four Reading Skills 



Skill 1 

Skill 2 

Skill 3 

Skill 4 

Skill 5 

Skill 6 

Skill 7 

Skill 8 

Skill 1 

1.00 

0.79 

0.81 

0.67 

0.62 

0.62 

0.57 

0.59 

Skill 2 


1.00 

0.86 

0.68 

0.60 

0.61 

0.56 

0.59 

Skill 3 



1.00 

0.70 

0.64 

0.63 

0.58 

0.60 

Skill 4 




1.00 

0.71 

0.67 

0.77 

0.67 

Skill 5 





1.00 

0.82 

0.78 

0.80 

Skill 6 






1.00 

0.85 

0.72 

Skill 7 







1.00 

0.86 

Skill 8 








1.00 

P( master) 

0.63 

0.62 

0.62 

0.44 

0.48 

0.47 

0.40 

0.43 


estimated with the mdltm software (von Davier, 2005) results in estimated correlations of between 
the reading and listening domains of 0.81 and 0.85 for the two administrations. 

When estimating the HGDM for the two administrations, the resulting statistics differ from 
those from the GDM in two ways. First, there are two skill distributions P(a \ c ) estimated, one for 
each of two mixture components c = 1 and c = 2, representing the largest of the between-cluster 
differences (here language group) that can be expected. Then cluster-skill distributions are formed 
by a cluster-specific (a language group, s) proportion P{c | s), or the probability of belonging to 
each of these skill distributions. Note that the TOEFL iBT datasets used here are composed of 
students from various language groups, and some of these languages are represented by only a few 
students. Since every cluster receives a different set of P{c \ s), one should consider collecting 
students representing very small language groups into larger clusters to avoid numerically unstable 
estimates of proportions for these small groups. 

The log likelihood for the eight-skill GDM and HGDM are reported in Table 3 together with 
the number of estimated parameters and the average log likelihood per observation. Note that 
the November administration included a larger number of language groups, some of which were of 
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rather small size. This led to a larger increase in the number of estimated parameters from GDM 
to HGDM for the November administration than for the February administration. 

Table 3 

Log Likelihood and Number of Parameters for the Eight-Skill General Diagnostic 
Model and Hierarchical General Diagnostic Model for Both Administration 



Log likelihood 

Parameters 

Average likelihood 

Independence 



-43.24 

FEB GDM 

-164435.20 

194 

-38.83 

FEB HGDM 

-163883.00 

318 

-38.70 

FEB 2PL2 

-160799.26 

160 

-37.97 

FEB H2PL2 

-160297.34 

251 

-37.85 

Independence 



-41.92 

NOV GDM 

-196009.88 

195 

-37.44 

NOV HGDM 

-195480.19 

337 

-37.34 

NOV 2PL2 

-191431.64 

160 

-36.57 

NOV H2PL2 

-190905.63 

269 

-36.47 


The average likelihood per response pattern is improved by a small amount when including 
the language group as clustering variable. However, compared to the gain by assuming the GDM 
rather than independence of all observed variables, the gain in going from GDM to HGDM seems 
comparably small. For comparisons, the loglikelihood, parameters, and average-response pattern 
likelihoods are also presented for the two-dimensional 2PL/GPC model, which are estimated as a 
nonhierarchical model (2PL2) and a hierarchical model (H2PL2), and are also given in the table. 
As von Davier (in press-a) reported, the two-dimensional 2PL IRT model is a more parsimonious 
description of the TOEFL iBT pilot data than the eight-skill model, a result that holds up for both 
the February and the November administrations. The eight-skill model, however, is the focus of an 
ongoing methods comparison by TOEFL researchers, so it is adopted for subsequent comparisons 
between GDM and HGDM here without any comparisons to the two-dimensional 2PL/GPCM 
model. 

Table 4 shows the two resulting marginal skill distributions for the February administration, 
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and Table 5 shows the same information for the November administration. For both 
administrations, the mixture component Cl shows much lower mastery probabilities than 
component C 2. The mixture component C 2 is characterized by high probabilities of mastery of 
all eight skills for both administrations. The marginal sizes of the two components TTc 2 ,Feb and 
itc 2 ,Nov f°r the two administrations differ somewhat; there is about 42% in the high proficiency 
class in November, whereas there is about 51% in February. 

Table 4 

Marginal Skill Distributions for the Two Mixture Components Cl and C 2 in the 
February Eight-Skill Hierarchical General Diagnostic Model With Skill Mastery 
Probabilities Given and Marginal Sizes of the Mixture Components Are fci = 0.49 

and 7TC2 = 0.51 



Skill 1 

Skill 2 

Skill 3 

Skill 4 

Skill 5 

Skill 6 

Skill 7 

Skill 8 

P(master Cl) 

0.33 

0.28 

0.16 

0.38 

0.15 

0.05 

0.13 

0.06 

P (master C2) 

0.92 

0.93 

0.96 

0.97 

0.94 

0.85 

0.85 

0.72 


Table 5 

Marginal Skill Distributions for the Two Mixture Components Cl and C 2 in the 
November Eight-Skill Hierarchical General Diagnostic Model With Skill Mastery 
Probabilities Given and Marginal Sizes of the Mixture Components Are vrci = 0.58 

and nc2 = 0.42 



Skill 1 

Skill 2 

Skill 3 

Skill 4 

Skill 5 

Skill 6 

Skill 7 

Skill 8 

P(master Cl) 

0.40 

0.38 

0.38 

0.16 

0.16 

0.11 

0.04 

0.09 

P (master C2) 

0.97 

0.95 

0.94 

0.90 

0.93 

0.98 

0.93 

0.91 


The two mixture components Cl and C 2 represent the largest possible differences between 
clusters (language groups) in the sample, since each cluster receives an estimate of a proportion 
P[C 2 | s)—and with that, implicitly, P(C 1 \ s) = 1 — P(C 2 | s )—of members estimated to belong 
in the high versus low proficiency components C2 and Cl. Since the mastery probabilities of all 
skills are much higher in C2 compared to Cl for both administrations, this proportion can be 
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interpreted as the proportion of examinees in each language group who are highly proficient with 
respect to the assessment items reflected in the skill definitions. These proportions can be studied 
across administrations, so that the variation (or the lack thereof) of the proportion of highly 
proficient students in the language groups becomes a target of inference. This target delivers 
information about how well-aligned the TOEFL assessment is for the different language groups 
represented in the sample. 

Figure 1 shows the proportion of students falling in the high performing class for the November 
and February administrations. The table contains only those language groups for which at least 
10 students were observed for each administration of the TOEFL. It can be seen that the class 
sizes vary across administrations but are relatively stable when languages are compared. For 
example, the proportion of students with a Chinese (CHI) language background is smaller than 
the proportion of students with a French (FRE) language background (see the appendix for the 
language-specific class sizes). The correlation between the two high proficient class-size estimates 
across 37 countries is 0.787. When a weighted correlation (with weights defined as the geometric 
mean of the two language-group-specific sample sizes, one for each administration) across all 116 
language groups is calculated, the correlation between the class-size estimates is 0.89. 

The consistency of the language-group-proportion estimates and the substantial correlation 
of these estimates across the two administrations are evident from Figure 1. For estimates of the 
skill-mastery probabilities of language groups, the P{C2 | s ) and the mixture-component skill 
probabilities can be combined, resulting in 

C2 

P(a | s) = Y p (a\ c)P(c | s) 

c=C 1 

for the language^group-specific skill distribution. As an illustration, the marginal skill mastery 
probabilities for the November and February administrations have been calculated for the CHI and 
Spanish (SPA) language groups. Table 6 shows the language-group-specific marginal skill mastery 
probabilities for CHI and SPA for the two administrations. It can be seen that the skill mastery 
probabilities range between 0.54 and 0.69 for the listening skills in the Spanish language sample 
and between 0.40 and 0.58 for the Chinese language sample for the November administration. For 
the reading skills, the mastery probabilities range between 0.32 and 0.41 for the Chinese language 
sample and between 0.49 and 0.55 for the Spanish language sample. 

It is important to note that the language-group proportions as well as the estimates of 


14 



[e 'Isszis 


_ °9 
o 

CD 

o 

VO 

1 

Z'O 

1 




OO 


o 

o 



o o 


ho 


o 


o 

o 

o 

o 

o 

o 

o 


GO 


o 

cP 


o 





4^ 

O 

O 

O 




o 

o o 
o 

o 

o 



CJ1 

o 

o 

o 




o 

o 

o 

o 

o 

o 



o> 

o 

o 




o 






o 





o 




o 





bo 

o 






Figure 1 Plot of the high proficiency class-size correspondence across two administra¬ 
tions of the TOEFL iBT based on 37 language groups for which sample sizes exceeded 
10 in both administrations. 

skill-mastery probabilities will vary somewhat over the administrations, even though the ordering 
of language-group-specific mastery estimates may stay stable. The estimates presented here are 
based on 4 + 4 skills with high correlations within the reading and listening domains as well as 
across. Therefore, a similar analysis may be tried with a model that joins the four postulated 
skills per domain into one overarching dimension by estimating a two-dimensional model instead. 
However, for the purpose of providing statistics on skill mastery for ongoing TOEFL research, it 
was necessary in the current study to use the expert-generated eight-skill matrix. As a result, the 
language-group-specific profiles of skill mastery will, due to the nature of the highly correlated 
skills, mostly reflect overall differences in the proficiency level of the applicant samples across 
language groups. 


7 Conclusions 

This paper introduces a hierarchical version of the GDM (von Davier, in press-a) and shows 
the effect of clustering through a comparison of results from two administrations of the TOEFL 
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Table 6 


Language-Group-Specific Mastery Probabilities Exemplified Using the November and 
February Administrations Based on Mixing Components and the Chinese and 

Spanish Language Groups 


CHI and SPA in Nov 

Skill 1 

Skill 2 

Skill 3 

Skill 4 

Skill 5 

Skill 6 

Skill 7 

Skill 8 

Cl: 0.68 (CHI), 0.49 (SPA) 

0.40 

0.38 

0.38 

0.16 

0.16 

0.11 

0.04 

0.09 

C2: 0.32 (CHI), 0.51 (SPA) 

0.97 

0.95 

0.94 

0.90 

0.93 

0.98 

0.93 

0.91 

P (SKILL | CHI) 

0.58 

0.56 

0.56 

0.40 

0.41 

0.39 

0.32 

0.35 

P( SKILL] SPA 

0.69 

0.67 

0.67 

0.54 

0.55 

0.55 

0.49 

0.51 


CHI and SPA in Feb 

Skill 1 

Skill 2 

Skill 3 

Skill 4 

Skill 5 

Skill 6 

Skill 7 

Skill 8 

Cl: 0.79 (CHI), 0.33 (SPA) 

0.33 

0.28 

0.16 

0.38 

0.15 

0.05 

0.13 

0.06 

C2: 0.21 (CHI), 0.67 (SPA) 

0.92 

0.93 

0.96 

0.97 

0.94 

0.85 

0.85 

0.72 

P (SKILL] CHI) 

0.45 

0.42 

0.33 

0.50 

0.32 

0.22 

0.28 

0.20 

P(SKILL SPA) 

0.63 

0.61 

0.57 

0.68 

0.55 

0.46 

0.50 

0.40 


iBT assessment when estimating language-group-specific proficiencies. The HGDM provides 
reliable estimates of proportions of high proficiency across language groups. The correlation of the 
estimates is 0.78 for the 37 largest language groups not weighted by sample size, and it increases 
to 0.89 when all language groups that are present in both administrations are weighted according 
to their pooled sample size. 

If the clustering is informative as it seems to be in the TOEFL case, the prediction of 
proficiency can potentially be improved, as seen in the slight increase of average log-likelihood 
(see Table 3). The clustering, or language-group membership in the analyses presented here, acts 
as ancillary information, so that the fit of the HGDM to the observed cognitive item responses 
can be compared to models without a clustering variable. The results presented here indicate 
that a mixture of different class-specific skill distributions is a useful tool in conjunction with 
cluster-specific mixing proportions to model the dependency of skill distribution on a clustering 
variable. The approach estimates conditional skill distributions across the whole sample, which, in 
the parameterization chosen, represent different expected skill profiles in unknown subpopulations 
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of a mixture distribution. The cluster-specific mixing proportions then estimate the composition 
of the clusters—here language groups—based on the assumption that the mixture-distribution 
subpopulations are represented in varying levels across clusters. In this example, the mixture 
components turned out to be ordered proficiency classes, due to the nature of the eight skills 
applied, which are known to be substantially correlated. 

The estimated proportions, more specifically the variance of these proportions across 
clusters, and the consistency of identified proportions across administrations can provide valuable 
information about the sources of proficiency variation in hierarchically organized data. The HCfDM 
provides a tool to study such variations in the context of item response models, latent class models, 
and diagnostic models for profile scoring. 
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Appendix 

Proportions of High Proficiency Class Membership by Country for 
the February and November Administration of the TOEFL iBT for 37 
Language Groups for Which Sample Sizes Exceeded 10 in Both Administrations 


Lang. 

N(FEB) 

P(C2 FEB) 

N(NOV) 

P(C2 NOV) 

CHI 

609 

0.2046 

657 

0.3185 

VIE 

33 

0.0883 

92 

0.2681 

KOR 

604 

0.2148 

832 

0.2094 

RUM 

32 

0.8560 

35 

0.5611 

FRE 

467 

0.7818 

357 

0.6850 

URD 

29 

0.3433 

43 

0.5167 

GER 

458 

0.9067 

433 

0.8412 

POL 

27 

0.8082 

58 

0.4816 

ITA 

378 

0.6154 

331 

0.4629 

IND 

27 

0.1886 

45 

0.3673 

SPA 

294 

0.6712 

483 

0.5125 

TAM 

21 

0.6182 

26 

0.6456 

JPN 

245 

0.2847 

410 

0.1344 

BEN 

19 

0.4665 

19 

0.6505 

ARA 

119 

0.3010 

187 

0.1382 

BUL 

19 

0.7412 

19 

0.5115 

TGL 

82 

0.5033 

111 

0.3377 

HEB 

19 

0.8938 

28 

0.6925 

RUS 

74 

0.7192 

136 

0.5592 

MAL 

18 

0.8636 

25 

0.6301 


(table continues) 
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Table (continued) 


Lang. 

N(FEB) 

P(C2 FEB) 

N(NOV) 

P(C2 NOV) 

TEL 

60 

0.6326 

43 

0.5539 

UKR 

15 

0.4151 

15 

0.3913 

POR 

59 

0.7922 

73 

0.5308 

ALB 

14 

0.4363 

18 

0.6206 

ENG 

58 

0.5389 

66 

0.5206 

CZE 

13 

0.6163 

13 

0.3456 

THA 

48 

0.1178 

91 

0.1669 

IBO 

12 

0.8651 

13 

0.4731 

HIN 

48 

0.7168 

70 

0.7504 

PAN 

11 

0.3585 

15 

0.3225 

TUR 

48 

0.3000 

76 

0.3310 

N/A 

11 

0.7544 

20 

0.6363 

FAS 

43 

0.4183 

58 

0.2420 

YOR 

10 

0.9075 

11 

0.7347 

GUJ 

37 

0.1997 

40 

0.3770 

AMH 

10 

0.1595 

23 

0.2089 
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